Skip to content

[serve][llm][transcription] Add support for Transcription in vLLM engine backend#57194

Merged
kouroshHakha merged 41 commits intoray-project:masterfrom
Blaze-DSP:master
Oct 24, 2025
Merged

[serve][llm][transcription] Add support for Transcription in vLLM engine backend#57194
kouroshHakha merged 41 commits intoray-project:masterfrom
Blaze-DSP:master

Conversation

@Blaze-DSP
Copy link
Contributor

@Blaze-DSP Blaze-DSP commented Oct 4, 2025

Why are these changes needed?

Expose an transcriptions API like [https://platform.openai.com/docs/api-reference/audio] using vLLM

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Blaze-DSP Blaze-DSP requested review from a team as code owners October 4, 2025 20:49
@Blaze-DSP Blaze-DSP requested a review from a team as a code owner October 4, 2025 20:50
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new transcription API, following the OpenAI specification. The changes are well-structured, touching the necessary model definitions, LLM server, vLLM engine, and router components. The implementation largely follows existing patterns in the codebase. However, I've identified a couple of critical issues that would cause runtime errors, such as a missing comma in a type hint and a method name mismatch between the server and the engine. There are also some minor maintainability issues like a copy-pasted comment and a typo in a docstring. Addressing these points will make the PR ready for merging.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces support for a transcription API to vLLM's OpenAI-compatible interface, following the OpenAI audio/transcriptions API specification. The implementation adds the necessary request/response models, router endpoints, and engine integration to handle audio transcription requests.

  • Adds TranscriptionRequest, TranscriptionResponse, and TranscriptionStreamResponse models
  • Implements /v1/audio/transcriptions endpoint in the router
  • Integrates transcription support into the vLLM engine with proper error handling

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
python/ray/serve/llm/openai_api_models.py Adds public API models for transcription request/response types
python/ray/llm/_internal/serve/deployments/routers/router.py Implements transcription endpoint and updates request processing logic
python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py Adds transcription engine integration with vLLM OpenAI serving
python/ray/llm/_internal/serve/deployments/llm/llm_server.py Adds transcription method to LLM server with async generator interface
python/ray/llm/_internal/serve/configs/openai_api_models.py Defines internal transcription models and response type unions

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I think the basic feature looks good. We need to just add CI tests and some release tests as well.

For CI please take a look at existing tests for the endpoints at engine and router levels. Here are some I found:

You would need to create a mock engine with some reasonable transcription behavior.

Let's keep the translation for another PR after we cover everything for this new endpoint.

For release test, could you share the serve run script that you used to validate the behavior along with the client code and expected output. We can turn that into a gpu release test with a real model (maybe using whisper-tiny, etc) so that it is continuously tested.

@ray-gardener ray-gardener bot added serve Ray Serve Related Issue docs An issue or change related to documentation llm community-contribution Contributed by the community labels Oct 5, 2025
@Blaze-DSP Blaze-DSP requested a review from a team as a code owner October 7, 2025 20:32
cursor[bot]

This comment was marked as outdated.

@Blaze-DSP
Copy link
Contributor Author

@kouroshHakha CI tests have been written and docs have also been updated. Pls check and verify.

If we are going to adopt vllm==v0.11.0, then v0 has been entirely been depricated and all models are supported via the v1 engine. Need to make appropriate changes for docs, etc(for e.g., embeddings).

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding @eicherseiji for review.

Copy link
Contributor

@eicherseiji eicherseiji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend pip install pre-commit && pre-commit install before a lint commit to satisfy the CI.

For a release test, recommend

def test_llm_serve_correctness(
and
- name: llm_serve_correctness
for examples.

Looks like we're in pretty good shape though. Just a few comments + release test and we should be good. Thanks!

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: Batched Response Type Handling Issue

The condition isinstance(first_chunk, NON_STREAMING_RESPONSE_TYPES) checks if first_chunk is an instance of a tuple of types, but NON_STREAMING_RESPONSE_TYPES is defined as a tuple containing ChatCompletionResponse, CompletionResponse, and TranscriptionResponse. However, when batching is enabled, first_chunk could be extracted from a list (line 538-541), and the isinstance check should handle both the direct response object and the case where it's wrapped in a list. The current logic may incorrectly identify a non-streaming response when the first item in a batched list happens to be one of these types, even though the overall response is streaming.

Fix in Cursor Fix in Web

@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: File Stream Exhaustion in Transcription Method

The transcriptions method consumes the request.file stream when reading audio data. This prevents subsequent operations, such as retries or logging, from accessing the file content.

Fix in Cursor Fix in Web

@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: Streaming Response Type Inconsistency

The type annotations for LLMChatResponse, LLMCompletionsResponse, and LLMTranscriptionResponse are inconsistent regarding streaming responses. They currently include both str (for SSE format) and specific *StreamResponse objects, which creates ambiguity about the actual type yielded during streaming.

Fix in Cursor Fix in Web

@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: Bug

The request_id field is added to TranscriptionRequest with a default factory that generates a random UUID. However, based on the PR discussion comment "@kouroshHakha: do we need request_id here?", this field may not be needed. More critically, looking at the context files, CompletionRequest and EmbeddingCompletionRequest have similar request_id fields with TODO comments indicating they should be upstreamed to vLLM. The issue is that TranscriptionRequest inherits from vLLMTranscriptionRequest, and if vLLM's base class doesn't have this field, adding it here could cause serialization/deserialization issues when the request is passed to vLLM's engine, similar to the Pydantic ValidatorIterator issue mentioned in the code comments. The field should either be removed or the same TODO comment should be added indicating it needs to be upstreamed.

Fix in Cursor Fix in Web

@kouroshHakha
Copy link
Contributor

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new transcription API, which is a fantastic addition to the LLM serving capabilities. The implementation is well-structured, with changes spanning from the public API models to the vLLM engine, and includes comprehensive tests. The refactoring in ingress.py to generalize request processing is a notable improvement for maintainability. I have a few suggestions regarding potential memory usage with large audio files, improving the clarity of the documentation example, and fixing a broken link in a TODO comment. Overall, this is a solid contribution.

Comment on lines 59 to 80
voxtral_llm_config = LLMConfig(
model_loading_config={
"model_id": "voxtral-mini",
"model_source": "mistralai/Voxtral-Mini-3B-2507",
},
deployment_config={
"autoscaling_config": {
"min_replicas": 1,
"max_replicas": 2,
}
},
accelerator_type="A10G",
# You can customize the engine arguments (e.g. vLLM engine kwargs)
engine_kwargs={
"tokenizer_mode": "mistral",
"config_format": "mistral",
"load_format": "mistral",
},
log_engine_metrics=True,
)

app = build_openai_app({"llm_configs": [whisper_llm_config, voxtral_llm_config]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This example, which is focused on transcriptions, also includes the configuration for a second, unrelated model (voxtral-mini). This could be confusing for users who are looking for a minimal, focused example on how to set up a transcription service.

For better clarity and to make the example easier to copy and adapt, I recommend removing the voxtral_llm_config and updating the app creation on line 80 to only use the whisper_llm_config, like this:

app = build_openai_app({"llm_configs": [whisper_llm_config]})

This will make the example more direct and easier to follow for the specific use case of transcriptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, should we just keep voxtral_llm_config which also shows the engine_kwargs?

raw_request = self._create_raw_request(request, "/audio/transcriptions")

# Extract audio data from the request file
audio_data = await request.file.read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Reading the entire audio file into memory with await request.file.read() could lead to high memory consumption, especially with large audio files (the OpenAI API limit is 25MB) and concurrent requests. This might risk Out-Of-Memory errors on the replica.

If the underlying create_transcription method in vLLM supports it, consider streaming the file content instead of reading it all at once. This could be done by passing a file-like object or an async generator. If vLLM requires the full byte string, this is an acceptable limitation, but it's an important performance consideration to be aware of.

Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Looks great. Just a few nits before wrapping up this PR:

"async-timeout; python_version < '3.11'",
"typer",
"meson",
"pybind11",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eicherseiji a chore we should do is to make this part of setup.py read the llm-requirements.txt directly so we just update one source of truth down the line.

Comment on lines 59 to 80
voxtral_llm_config = LLMConfig(
model_loading_config={
"model_id": "voxtral-mini",
"model_source": "mistralai/Voxtral-Mini-3B-2507",
},
deployment_config={
"autoscaling_config": {
"min_replicas": 1,
"max_replicas": 2,
}
},
accelerator_type="A10G",
# You can customize the engine arguments (e.g. vLLM engine kwargs)
engine_kwargs={
"tokenizer_mode": "mistral",
"config_format": "mistral",
"load_format": "mistral",
},
log_engine_metrics=True,
)

app = build_openai_app({"llm_configs": [whisper_llm_config, voxtral_llm_config]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, should we just keep voxtral_llm_config which also shows the engine_kwargs?

"""
pass

@abc.abstractmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make this method not abstract. So the subclasses can skip their implementation. We can have a NotImplementedError here. I think this points apply to all the endpoints actually, but came to my mind right now :)

@kouroshHakha
Copy link
Contributor

making sure release tests pass: https://buildkite.com/ray-project/release/builds/65185

@kouroshHakha kouroshHakha changed the title Transcription API [serve][llm][transcription] Add support for Transcription in vLLM engine backend Oct 23, 2025
Blaze-DSP and others added 2 commits October 24, 2025 00:20
Signed-off-by: DPatel_7 <dpatel@gocommotion.com>
@Blaze-DSP
Copy link
Contributor Author

@kouroshHakha made appropriate changes from the review.

@eicherseiji
Copy link
Contributor

Requested stamp from @richardliaw

@kouroshHakha
Copy link
Contributor

@Blaze-DSP the gpu test on the example is failing now (probably due to changing to a different model?) can you take a look. It has a gpu memory problem.

Premerge Build 52450 LLM GPU Tests.log

DPatel_7 and others added 2 commits October 24, 2025 12:22
Signed-off-by: DPatel_7 <dpatel@gocommotion.com>
AsyncGenerator[
Union[str, ChatCompletionStreamResponse, ChatCompletionResponse, ErrorResponse],
None,
],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Type Mismatch in Streaming Responses

The LLM*Response type annotations (e.g., LLMChatResponse, LLMCompletionsResponse, LLMTranscriptionResponse) include stream response object types. However, streaming endpoints ultimately yield SSE-formatted strings, not these objects. This creates a mismatch between the declared types and the actual runtime output, which can lead to type checking issues and confusion.

Additional Locations (1)

Fix in Cursor Fix in Web

"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."
),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Request ID Conflict in Transcription Classes

The TranscriptionRequest class explicitly adds a request_id field. This could conflict with vLLMTranscriptionRequest if the base class already defines it, potentially causing duplication or unexpected behavior. This also differs from other request types that inherit request_id from their vLLM base classes.

Fix in Cursor Fix in Web

set(
[
"vllm>=0.11.0",
"vllm[audio]>=0.11.0",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Unnecessary Audio Dependencies in Ray[LLM]

The ray[llm] extra now unconditionally pulls in vllm[audio], adding audio-related dependencies. This increases the dependency footprint for all ray[llm] users, even those not needing transcription features, making the installation heavier than necessary.

Fix in Cursor Fix in Web

"async-timeout; python_version < '3.11'",
"typer",
"meson",
"pybind11",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Setup.py and Requirements.txt Mismatch

The setup.py adds "meson" and "pybind11" to the llm extras, but these dependencies are not present in the corresponding llm-requirements.txt file (lines 14-15 in the requirements file). While the comment on line 372 states "Keep this in sync with python/requirements/llm/llm-requirements.txt", these two packages are only added to setup.py and not to the requirements file, creating an inconsistency between the two dependency specifications.

Fix in Cursor Fix in Web

raw_request = self._create_raw_request(request, "/audio/transcriptions")

# Extract audio data from the request file
audio_data = await request.file.read()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Audio File Reading and Pointer Issue

The code calls await request.file.read() to extract audio data from the request file. However, this reads the entire file into memory, which could be problematic for large audio files. Additionally, after reading the file once, the file pointer is at the end, so if vLLM's create_transcription method tries to read from request.file again, it will get empty data. The audio_data is extracted but the original request.file is still passed to create_transcription, which may cause issues if vLLM expects an unread file object.

Fix in Cursor Fix in Web

Signed-off-by: DPatel_7 <dpatel@gocommotion.com>
"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."
),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: UUID Inconsistency Causes vLLM Compatibility Issues

The TranscriptionRequest adds a request_id with a default UUID, which is inconsistent with other request types that mark this field for upstreaming to vLLM. This approach, questioned in a prior discussion, may cause compatibility issues with vLLM's backend.

Fix in Cursor Fix in Web

Union[CompletionStreamResponse, CompletionResponse, ErrorResponse], None
Union[str, CompletionStreamResponse, CompletionResponse, ErrorResponse], None
],
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Type Mismatch in Streaming Responses

The LLMChatResponse and LLMCompletionsResponse type annotations include *StreamResponse objects, but the engine's streaming responses actually yield raw strings. This creates a mismatch between the declared types and the actual return values.

Fix in Cursor Fix in Web

# vLLM implementation for handling transcription requests: https://github.com/vllm-project/vllm/blob/0825197bee8dea547f2ab25f48afd8aea0cd2578/vllm/entrypoints/openai/api_server.py#L839.
async def transcriptions(
self, body: Annotated[TranscriptionRequest, Form()]
) -> Response:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Audio File Serialization Issue in Ray Serve

The transcriptions endpoint uses Annotated[TranscriptionRequest, Form()] to handle form data, but the TranscriptionRequest object contains a file field that is an UploadFile. When this is passed through Ray Serve's remote call mechanism (in _get_response), the UploadFile object may not be serializable for pickling, which is required for Ray remote calls. This could cause serialization errors when the request is sent to the model deployment. The audio data should be extracted and passed as bytes before the remote call, similar to what's done in the vLLM engine at line 473.

Fix in Cursor Fix in Web

@Blaze-DSP
Copy link
Contributor Author

mb, fixed the issue @kouroshHakha

@eicherseiji
Copy link
Contributor

Release test llm_batch_vllm_multi_node is jailed on master, failure can be ignored: https://buildkite.com/ray-project/release/builds/65341#019a15f2-6b2e-474b-9ded-e4ed4d9fe246

All other release tests passing

@kouroshHakha kouroshHakha merged commit ca1f7d9 into ray-project:master Oct 24, 2025
5 of 6 checks passed
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 27, 2025
…ine backend (ray-project#57194)

Signed-off-by: DPatel_7 <dpatel@gocommotion.com>
Co-authored-by: DPatel_7 <dpatel@gocommotion.com>
Signed-off-by: xgui <xgui@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ine backend (ray-project#57194)

Signed-off-by: DPatel_7 <dpatel@gocommotion.com>
Co-authored-by: DPatel_7 <dpatel@gocommotion.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ine backend (ray-project#57194)

Signed-off-by: DPatel_7 <dpatel@gocommotion.com>
Co-authored-by: DPatel_7 <dpatel@gocommotion.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ine backend (ray-project#57194)

Signed-off-by: DPatel_7 <dpatel@gocommotion.com>
Co-authored-by: DPatel_7 <dpatel@gocommotion.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community docs An issue or change related to documentation go add ONLY when ready to merge, run all tests llm serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants